# Models to Try!

## Inverse-multiquadratic Hamming (IMQ-H) kernel

$$
k_{\text{IMQ-h}}(X,Y)= \frac{1}{1+d_{H}^{\Phi}(X,Y))^2} = \frac{1}{(1+|X| \vee |Y|-(\Phi(x)|\Phi(Y)))^2}$$

Notes:
1) **Discrete masses indicate universality & characteristicness**: By integrating the Hamming kernel over an inverse-multiquadratic weight, the IMQ-H acquires the discrete-mass property. That means its RKHS contains delta functions at every sequence, which in turn guarantees the kernel is universal (it can approximate any sequence‐to‐phenotype map) and characteristic (its MMD will distinguish any two distributions).
2) **Heavy tails mitigate diagonal-dominance:** The ordinary exponential Hamming kernel $e^{-\lambda d_H}$ decays too quickly as $d_H$ grows causing "diagonal dominance" (almost all off-diagonal Gram entries vanish). IMQ-H decays like a power-law $(1+d_H)^{-2}$, retaining meaningful similarity scores even for more distant sequence pairs. Important when the tsv file has many variants that differ by multiple residues. 
3) **Exact positional comparison preserved:** IMQ-H uses the exact same $\Phi$-features (k-mer counts in fixed windows) as your original Hamming kernel. You still compare wild-type vs. mutant sequences position-by-position (so any biological rationale you have for using Hamming remains valid), but the functional form has changed to grant strong theoretical guarantees at essentially no extra computational cost

### Implementation

In [2]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import logging 

In [6]:
import numpy as np
from seq_tools import hamming_dist, get_ohe

def imq_hamming_kernel(seqs_x, seqs_y=None, alphabet_name='dna', scale=1, beta=0.5, lag=1):
    """
    Compute the inverse-multiquadratic Hamming kernel between two sets of sequences.

    Parameters:
    seqs_x : numpy array or list of strings
        The first set of sequences.
    seqs_y : numpy array or list of strings, optional
        The second set of sequences. If None, computes the kernel between seqs_x and itself.
    alphabet_name : str, optional
        The alphabet type ('dna', 'rna', 'prot').
    scale : float, optional
        Scale parameter in the IMQ kernel.
    beta : float, optional
        Exponent parameter in the IMQ kernel.
    lag : int, optional
        Length of k-mers used for Hamming distance calculation.

    Returns:
    kernel_matrix : numpy array
        The computed kernel matrix.
    """
    if seqs_y is None:
        seqs_y = seqs_x

    # Compute Hamming distances
    h_dists = hamming_dist(seqs_x, seqs_y, alphabet_name=alphabet_name, lag=lag)

    # Compute IMQ-Hamming kernel
    kernel_matrix = (1 + scale) ** beta / (scale + h_dists) ** beta

    return kernel_matrix

In [3]:
# Example usage
if __name__ == "__main__":
    seqs_x = ["ACGT", "ACGA", "TCGT"]
    seqs_y = ["ACGG", "ACGC"]

    kernel_mat = imq_hamming_kernel(seqs_x, seqs_y, alphabet_name='dna', scale=1, beta=0.5, lag=1)
    print("IMQ-Hamming Kernel Matrix:")
    print(kernel_mat)

IMQ-Hamming Kernel Matrix:
[[1.         1.        ]
 [1.         1.        ]
 [0.81649658 0.81649658]]


In [3]:
df = pd.read_csv('mutation_with_sequences.csv')
print(df.head())

       Source   Gene                ENST Gene Code           ENST.1  \
0  cBioPortal  BRCA1   ENST00000357654.9    P38398  ENST00000357654   
1  cBioPortal  BRCA2   ENST00000380152.8    P51587  ENST00000380152   
2  cBioPortal   CDH1  ENST00000261769.10    P12830  ENST00000261769   
3  cBioPortal   CDH1  ENST00000261769.10    P12830  ENST00000261769   
4  cBioPortal   CDH1  ENST00000261769.10    P12830  ENST00000261769   

     Gene Name Mutation    Type  \
0  BRCA1_HUMAN   G1788V  Driver   
1  BRCA2_HUMAN   R2336C  Driver   
2  CADH1_HUMAN    D288N  Driver   
3  CADH1_HUMAN    D254Y  Driver   
4  CADH1_HUMAN    R732Q  Driver   

                                            wild_seq  \
0  MDLSALRVEEVQNVINAMQKILECPICLELIKEPVSTKCDHIFCKF...   
1  MPIGSKERPTFFEIFKTRCNKADLGPISLNWFEELSSEAPPYNSEP...   
2  MGPWSRSLSALLLLLQVSSWLCQEPEPCHPGFDAESYTFTVPRRHL...   
3  MGPWSRSLSALLLLLQVSSWLCQEPEPCHPGFDAESYTFTVPRRHL...   
4  MGPWSRSLSALLLLLQVSSWLCQEPEPCHPGFDAESYTFTVPRRHL...   

                         

In [4]:
original_seqs = df['wild_seq'].tolist()[:100]
mutated_seqs = df['mut_seq'].tolist()[:100]

In [7]:
def imq_hamming_kernel(seqs_x, seqs_y=None, alphabet_name='prot', scale=1, beta=0.5, lag=1):
    if seqs_y is None:
        seqs_y = seqs_x
    h_dists = hamming_dist(seqs_x, seqs_y, alphabet_name=alphabet_name, lag=lag)
    kernel_matrix = (1 + scale)**beta / (scale + h_dists)**beta
    return kernel_matrix

kernel_matrix = imq_hamming_kernel(original_seqs, mutated_seqs, alphabet_name='prot', scale=1, beta=0.5, lag=1)
print(kernel_matrix)

[[1.         0.02461084 0.03320446 ... 0.03296007 0.03296007 0.03296007]
 [0.02461084 1.         0.02441204 ... 0.02427499 0.02427142 0.02427499]
 [0.03320446 0.02441204 1.         ... 0.04828045 0.04830862 0.04825234]
 ...
 [0.03296007 0.02427499 0.04830862 ... 1.         1.         1.        ]
 [0.03296007 0.02427499 0.04830862 ... 1.         1.         1.        ]
 [0.03296007 0.02427499 0.04830862 ... 1.         1.         1.        ]]


In [13]:
# Example usage
if __name__ == "__main__":
    seqs_x = ["BCDA", "CDE", "DCE"]
    seqs_y = ["ABCD", "ABCDE"]

    kernel_mat = imq_hamming_kernel(seqs_x, seqs_y, alphabet_name='prot', scale=1, beta=0.5, lag=1)
    print("IMQ-Hamming Kernel Matrix:")
    print(kernel_mat)

IMQ-Hamming Kernel Matrix:
[[0.63245553 0.57735027]
 [0.63245553 0.57735027]
 [0.63245553 0.57735027]]
