In this notebook I test scipy and sklearn and how fast they produce the distance matrix.

I start by setting the path to the folder with all the classes.

In [1]:
import sys
sys.path.insert(1, '/home/elinfi/MasterCode/src/class/')

In [2]:
import random
import kmedoids

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

from time import time
from k_medoids import KMedoids
from mpl_toolkits.mplot3d import Axes3D
from dissimilarity_matrix import DissimilarityMatrix
from data_preparation import DataPreparation
from sklearn_extra.cluster import KMedoids



Set the path to the merged wild type and cancer Hi-C data. The resolution is set to 16000 to reduce run time, and the region is set to chr4:10M-15M for some example.

In [3]:
# get path to multi resolution hic data
path_wt = '/home/elinfi/coolers/HiC_wt_merged.mcool'
path_cancer = '/home/elinfi/coolers/HiC_cancer_merged.mcool'

# resolution
resolution = 32000

# region of genome
region = 'chr4:10M-15M'

The Hi-C data is then transformed to a numpy array showing the given region at the given resolution. By default the matrices are balanced using the ICE method built in to the cooler package.

In [4]:
# create objects of class
wt = DataPreparation(path_wt, resolution, region)
cancer = DataPreparation(path_cancer, resolution, region)

print("The shape of the Hi-C contact matrix: ")
print(wt.matrix.shape)

The shape of the Hi-C contact matrix: 
(157, 157)


In [5]:
# caluclate the relative difference between the two matrices
diff = wt.relative_difference(cancer)

In [6]:
# calculate the distance matrix
dissimilarity = DissimilarityMatrix(diff)

# using scipy.pdist
start = time()
distmat_scipy = dissimilarity.scipy_dist('chebyshev', 0, 3)
end = time()
print(f"Scipy: {end - start}s")

#using sklearn.pairwise_distance
start = time()
distmat_sklearn = dissimilarity.sklearn_dist('chebyshev')
end = time()
print(f"Sklearn: {end - start}s")

Scipy: 0.5094664096832275s
Sklearn: 0.21376562118530273s


In [7]:
# calculate the distance matrix
dissimilarity = DissimilarityMatrix(diff)

# using scipy.pdist
start = time()
distmat_scipy = dissimilarity.scipy_dist(metric='interactions_dist', col1=0, col2=3)
end = time()
print(f"Scipy: {end - start}s")

Scipy: 58.22506260871887s


In [9]:
#using sklearn.pairwise_distance
start = time()
distmat_sklearn = dissimilarity.sklearn_dist(metric='interactions_dist')
end = time()
print(f"Sklearn: {end - start}s")

Sklearn: 118.90540623664856s


Sklearn works best when using built in distance metrics, while scipy is faster when using own distance metrics. This is mainly visible when the size of the contact matrix is large.

In [None]:
# check if the two distance matrices are equal
print(np.sum(distmat_scipy != distmat_sklearn))