# Correlation analysis

This notebook demonstrates how to calculate the correlation between two RSP profiles. The correlation coefficient is a measure of the similarity between two RSP profiles. The correlation coefficient ranges from -1 to 1. A value of 1 indicates a perfect positive correlation, a value of -1 indicates a perfect negative correlation, and a value of 0 indicates no correlation.

Additionally, the script also calculates the Euclidean distance and Mean Squared Error (MSE) between the two RSP profiles. A smaller value indicates a more similar RSP profile, while a larger value indicates a more different RSP profile.

In [15]:
# Import necessary libraries
import numpy as np
import pandas as pd

from ipynb.fs.full.biorsp import (
    find_foreground_background_points,
    calculate_rsp_area,
)

from scipy.stats import pearsonr

In [16]:
# Load your data
dge_matrix = pd.read_csv("data/MCA2_filtered.dge.txt", sep="\t", index_col=0)
tsne_results = pd.read_csv("embeddings/tsne_results.csv").to_numpy()
dbscan_results = pd.read_csv("embeddings/tsne_dbscan_results.csv")

In [17]:
gene_a = 'Actc1'
gene_b = 'Tnnt2'

In [18]:
threshold = 1 # Define the threshold for foreground points - default is 1
clusters = [1] # Define the clusters to be considered as foreground - default is None (look at all clusters)
scanning_window=np.pi / 2 # Define the scanning window - default is pi/2
resolution=1000 # Define the resolution - default is 1000
angle_range=np.array([0, 2 * np.pi]) # Define the angle range - default is [0, 2*pi]
mode="absolute",  # Define the mode for CDFs - default is "absolute"

In [19]:
foreground_points_A, background_points_A = find_foreground_background_points(
    gene_name=gene_a,
    dge_matrix=dge_matrix,
    tsne_results=tsne_results,
    dbscan_df=dbscan_results,
    threshold=threshold,
    selected_clusters=clusters
)
vantage_point_A = np.mean(background_points_A, axis=0)
rsp_area_A, differences_A, rmsd_A = calculate_rsp_area(
    fg_points=foreground_points_A,
    bg_points=background_points_A,
    vantage_point=vantage_point_A,
    scanning_window=scanning_window,
    resolution=resolution,
    angle_range=angle_range,
    mode=mode
)

In [20]:
foreground_points_B, background_points_B = find_foreground_background_points(
    gene_name=gene_b,
    dge_matrix=dge_matrix,
    tsne_results=tsne_results,
    dbscan_df=dbscan_results,
    threshold=threshold,
    selected_clusters=clusters
)
vantage_point_B = np.mean(background_points_B, axis=0)
rsp_area_B, differences_B, rmsd_B = calculate_rsp_area(
    fg_points=foreground_points_B,
    bg_points=background_points_B,
    vantage_point=vantage_point_B,
    scanning_window=scanning_window,
    resolution=resolution,
    angle_range=angle_range,
    mode=mode
)

In [None]:
# Compute correlation coefficient and p-value
correlation_coefficient, p_value = pearsonr(differences_A, differences_B)
print(f"Correlation Coefficient: {correlation_coefficient}")
print(f"P-value: {p_value}")

# Compute Euclidean distance and MSE
euclidean_distance = np.linalg.norm(differences_A - differences_B)
mse = np.mean((differences_A - differences_B) ** 2)
print(f"Euclidean Distance: {euclidean_distance}")
print(f"Mean Squared Error: {mse}")

## Additional Metrics

We can also calculate additional metrics through Mutual Information (MI), Earth Mover's Distance (EMD), Spearman's Correlation Coefficient (SCC), Cross-correlation (CC), and Kolmogorov-Smirnov (KS) tests. These metrics can be used to evaluate the similarity between the two RSP profiles.

In [None]:
from sklearn.metrics import mutual_info_score
from scipy.stats import ks_2samp, spearmanr, wasserstein_distance

# 1. Mutual Information
mi = mutual_info_score(differences_A, differences_B)

# 2. Earth Mover's Distance
emd = wasserstein_distance(differences_A, differences_B)

# 3. Spearman's Correlation
spearman_coefficient, spearman_p = spearmanr(differences_A, differences_B)

# 4. Cross-correlation
cross_corr = np.correlate(differences_A - np.mean(differences_A), differences_B - np.mean(differences_B), mode='full')
lags = np.arange(-len(differences_A)+1, len(differences_A))
max_corr = np.max(cross_corr)
lag_at_max_corr = lags[np.argmax(cross_corr)]

# 5. KS Test
ks_statistic, ks_p_value = ks_2samp(differences_A, differences_B)

print(f"Mutual Information: {mi}")
print(f"Earth Mover's Distance: {emd}")
print(f"Spearman Correlation Coefficient: {spearman_coefficient}, P-value: {spearman_p}")
print(f"Maximum Cross-Correlation: {max_corr}, Lag: {lag_at_max_corr}")
print(f"KS Statistic: {ks_statistic}, P-value: {ks_p_value}")