This notebook explores different ways we can save the large similarity matrix in a more compressed format.

Similarity matrix is dense (all values filled in), and symmetric

In [237]:
import datetime
import os
import s3fs
import sys
import json

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
import seaborn as sns

from datascience.menupackage import data_loader
from datascience.menupackage import data_retrieval
from datascience.connections import redshift
from datascience import queries

In [297]:
bucket = 's3-gousto-artichokes-airflow'
s3_key = 'static_inputs/similarity_results.npz'
s3_connection = s3fs.S3FileSystem(anon=False)

In [246]:
%%time
similarity_matrix = data_loader.Recipe_Item_Loader.retrieve_similarity_matrix(
    s3_path=os.path.join(bucket, s3_key),
    fs=s3_connection
)

CPU times: user 599 ms, sys: 566 ms, total: 1.16 s
Wall time: 13.1 s


## First method: original method using np.savez

In [73]:
def write_npz(dataframe: pd.DataFrame) -> None:
    np.savez('original_method.npz', arr_0=dataframe.values, arr_1=dataframe.columns.values)

In [74]:
%%time
write_npz(similarity_matrix)

CPU times: user 23.6 ms, sys: 42.2 ms, total: 65.8 ms
Wall time: 66.9 ms


In [75]:
print('Original file size is {} MB'.format(os.path.getsize('original_method.npz')/10**6))
original_size = os.path.getsize('original_method.npz')/10**6
#os.remove('original_method.npz')

Original file size is 88.951626 MB


In [76]:
%%time
sim = np.load('original_method.npz')
similarity_matrix2 = pd.DataFrame(sim['arr_0'], columns=sim['arr_1'])
similarity_matrix2.index = sim['arr_1']

CPU times: user 26.2 ms, sys: 24.9 ms, total: 51.1 ms
Wall time: 49.8 ms


## Second method: simply use np.savez_compressed

In [77]:
# New file size is ...
def write_npz_compressed(dataframe: pd.DataFrame) -> None:
    np.savez_compressed('matrix.npz', arr_0=dataframe.values, arr_1=dataframe.columns.values)

In [80]:
%%time 
write_npz_compressed(similarity_matrix)

CPU times: user 2.74 s, sys: 30.2 ms, total: 2.77 s
Wall time: 2.78 s


In [82]:
%%time
sim = np.load('matrix.npz')
similarity_matrix2 = pd.DataFrame(sim['arr_0'], columns=sim['arr_1'])
similarity_matrix2.index = sim['arr_1']

CPU times: user 200 ms, sys: 15.3 ms, total: 215 ms
Wall time: 217 ms


In [215]:
print('Compressed file size is {} MB'.format(os.path.getsize('matrix.npz')/10**6))
compressed_size = os.path.getsize('matrix.npz')/10**6
print("This is {:.2f}% of original size".format(os.path.getsize('matrix.npz')/os.path.getsize('original_method.npz')*100))

Compressed file size is 14.138965 MB
This is 15.90% of original size


In [65]:
print("Size is smaller by {}".format(compressed_size/original_size))

Size is smaller by 0.15895116970655487


File is only 16% of its original size, but the time taken to save is a bit longer. But in reality the limiting factor to writing to s3 is probably caused by transfer of the actual data over the internet (i.e. the bottleneck is not the CPU), time taken to save is still only 200ms.

## Third method: make use of symmetry

Could save only the upper half triangle of the matrix.

In [None]:
data = similarity_matrix.values

In [221]:
def write_npz_compressed_upper_half(dataframe: pd.DataFrame) -> None:
    length = len(data)
    np.savez_compressed('third_method.npz', arr_0=dataframe.values[np.triu_indices(length, 1)], arr_1=dataframe.columns.values)

In [222]:
%%time
write_npz_compressed_upper_half(similarity_matrix)

CPU times: user 1.47 s, sys: 69.6 ms, total: 1.54 s
Wall time: 1.54 s


In [214]:
print('file size is {} MB'.format(os.path.getsize('third_method.npz')/10**6))
print("This is {:.2f}% of original size".format(os.path.getsize('third_method.npz')/os.path.getsize('original_method.npz')*100))

file size is 7.260964 MB
This is 8.16% of original size


In this case, we have to add some logic to loading the matrix.

In [217]:
%%time
sim = np.load('third_method.npz')
saved_values = sim['arr_0']
cols = sim['arr_1']

sim_matrix = np.zeros((len(col), len(col)))
for i, (j, k) in enumerate(zip(np.triu_indices(len(cols), 1)[0], np.triu_indices(len(cols), 1)[1])):
    sim_matrix[j][k] = saved[i]
    sim_matrix[k][j] = saved[i]
reconstructed_sim_matrix = pd.DataFrame(sim_matrix, columns=cols, index=cols)

CPU times: user 7.53 s, sys: 316 ms, total: 7.84 s
Wall time: 8.11 s


The time taken to 'unpack' the values takes some time - it would have O(N^2) time complexity.

If this file was saved in S3, how long would it take?

1. original method

In [290]:
%%time
sims = data_retrieval.Recipe_Item_Retrieval.retrieve_similarities(
    s3_path=os.path.join('s3-gousto-artichokes-airflow' ,'static_inputs/similarity_results.npz'),
    fs=s3_connection
)
similarity_matrix = pd.DataFrame(sims['arr_0'],columns = sims['arr_1'])
similarity_matrix.index = sims['arr_1']

CPU times: user 696 ms, sys: 637 ms, total: 1.33 s
Wall time: 15.8 s


2. new method

In [291]:
%%time
sims = data_retrieval.Recipe_Item_Retrieval.retrieve_similarities(
    s3_path=os.path.join('s3-gousto-artichokes-airflow' ,'static_inputs/third_method.npz'),
    fs=s3_connection
)
saved_values = sims['arr_0']
cols = sims['arr_1']

sim_matrix = np.zeros((len(col), len(col)))
for i, (j, k) in enumerate(zip(np.triu_indices(len(cols), 1)[0], np.triu_indices(len(cols), 1)[1])):
    sim_matrix[j][k] = saved[i]
    sim_matrix[k][j] = saved[i]

CPU times: user 6.95 s, sys: 170 ms, total: 7.12 s
Wall time: 7.89 s


## Last try - use Hdf5?

In [225]:
import numpy as np
import h5py
a = similarity_matrix.values
h5f = h5py.File('data.h5', 'w')
h5f.create_dataset('dataset_1', data=a)

<HDF5 dataset "dataset_1": shape (3334, 3334), type "<f8">

In [226]:
print('file size is {} MB'.format(os.path.getsize('data.h5')/10**6))

file size is 88.926496 MB


Never mind this doesn't reduce anything

## Conclusions

- with no changes, loading the data from S3, then unpacking the numpy array into a dataframe took wall time of 13.7 s on my local machine and CPU time of 616ms. In fact most time taken came from converting the 2d numpy array into a dataframe.
- size wise, it is best to use `np.savez_compressed` to use the compressed format, which reduces the size automatically by 6.25 times. This has the advantage that the method of loading of the matrix does not change at all so we won't have to change any of the code which ingests the similarity matrix. The time taken to save the matrix takes longer (23ms to 2.7s) but in pratice, the overhead for uploading would probably come from transfer of data into S3 rather than the compression.
- to halve that, we can save just the upper half of the triangle, which would make the size of file 12.3 times smaller than original. In fact, unpacking the array into a dataframe actually takes shorter time (wall time wise, it decreases by half, CPU time it increases a lot, not sure why? It's not waiting for a network or anything) from this, compared to the original method.

In [None]:
# Clean up files
files = ['matrix.npz', 'data.h5', 'original_method.npz', 'third_method.npz']
for file in files:
    os.remove(file)