### Imports

In [1]:
from spiq.streamer.data_streamer import DataStreamer
from spiq.utils.fingerprints import FingerprintCalculator

### FINGERPRINT MODULE

Simple example on how the API works

In [2]:
# Define a list of SMILES strings
smiles_list = ["CCO", "C1CCCCC1", "O=C=O", "O=C=O"]

# Define fingerprint parameters
params = {'fpSize': 2048, 'radius': 2}

# Create an instance of FingerprintCalculator
calculator = FingerprintCalculator()

# Compute fingerprints for the list of SMILES strings
fingerprints = calculator.FingerprintFromSmiles(smiles_list, 'morgan', **params)

# Display the shape of the output fingerprint array
print(f"Fingerprints shape: {fingerprints.shape}")
print("Fingerprint 1", fingerprints[1])

Fingerprints shape: (4, 2048)
Fingerprint 1 [0 0 1 ... 0 0 0]


How to use the API for loading fingerprints in chunks

In [3]:
# Import iterator method
ds = DataStreamer()

chunksize = 1_230
smiles= ds.parse_input(input_path='../data/data_lite.txt', chunksize=chunksize)
print(type(smiles)) # This is only the generator, in order to get each chunk of data we need to iterate

count = 0
for smiles_chunk in smiles:
     count += len(smiles_chunk)
     calculator.FingerprintFromSmiles(smiles_chunk, 'morgan', **params)
     print(f"\r Fingerprints calculated: {count:,}", end='', flush=True)

<class 'generator'>
 Fingerprints calculated: 10,000

If we want to save each chunk as a separate file -ideal for large chunks that we could use later- then `save_chunk` from the `helper_functions.py`is provided

In [4]:
from spiq.utils.helper_functions import save_chunk

smiles= ds.parse_input(input_path='../data/data_lite.txt', chunksize=chunksize)

count = 0
for idx, smiles_chunk in enumerate(smiles):
    count += len(smiles_chunk)
    fp_chunk = calculator.FingerprintFromSmiles(smiles_chunk, 'morgan', **params)
    save_chunk(fp_chunk, output_dir='../data/', chunk_index=idx, file_format='npy')
    print(f"\r Fingerprints calculated: {count:,}", end='', flush=True)

 Fingerprints calculated: 10,000

### Product Quantization

In [9]:
#First we load calcualte our fingerprints 
with open('../data/data_lite.txt', 'r') as file:
    smiles = file.read().split('\n')
fingerprints = calculator.FingerprintFromSmiles(smiles, 'morgan', fpSize=1024, radius=3)

print(fingerprints.shape, fingerprints.nbytes)

(10001, 1024) 10241024


Then we fit the PQ encoder using our training data. 
`K`refers to the number of centroids to be used when running KMeans on each subvector. 
`m`is the number of subvectors (splits) from our input data. 
`iterations`is the maximum number of iterations each KMeans is going to do. 
With higher `K`and `iterations`, higher training times. 

In [10]:
from spiq.encoder.encoder import PQEncoder

pq_encoder = PQEncoder(k=256, m=4, iterations=10)
pq_encoder.fit(fingerprints)

Training PQEncoder: 100%|██████████| 4/4 [00:01<00:00,  3.09it/s]


We can check some atributes: 
`.codebook_cluster_centers` are the centroids coordinates gathered from each KMeans run on every subvector. Since we have 4 splits, 256 centroids and the subvectors are of size 1024/4 = 256, then the codebook is shape (4, 256, 256)
After the `pq_encoder` is fitted, the encoder has an attribute to account for the training process. If we try to use transform without fitting we would get an Error. So know, we check that the ecoder was in fact trained. 
If we want to access all the `KMeans`attributes that one would normally get from sklearn, we can do so using the attribute `pq_trained` and use any attribute you would normally use. Like `.labels_` to check the index of the centroids for each training sample. 

In [11]:
print("The shape of the codebook is: ", pq_encoder.codebook_cluster_centers.shape)
print("Is the encoder trained? ", pq_encoder.encoder_is_trained)
print(f"The lables: {pq_encoder.pq_trained.labels_} are of length: {len(pq_encoder.pq_trained.labels_)}")

The shape of the codebook is:  (4, 256, 256)
Is the encoder trained?  True
The lables: [212 212 212 ... 153  60  53] are of length: 10001


After the training process we can create our PQ codes.
The PQCodes are going to be of shape `(Number of samples, m)`. 

In [15]:
import numpy as np 
import time

#First we load and calculate our fingerprints 
with open('../data/training_data.txt', 'r') as file:
    smiles = file.read().split('\n')
X_test= calculator.FingerprintFromSmiles(smiles, 'morgan', fpSize=1024, radius=3)

print(f"{X_test.shape[0]:,} fingerprints of {X_test.shape[1]} dimensions to be transformed into PQ-codes")

s = time.time()
X_pq_code = pq_encoder.transform(X_test)
e = time.time()
print(f"Transforming {X_test.shape[0]:,} fingeprints took {(e-s):.2f} seconds")

1,000,001 fingerprints of 1024 dimensions to be transformed into PQ-codes


Generating PQ-codes: 100%|██████████| 4/4 [00:04<00:00,  1.19s/it]

Transforming 1,000,001 fingeprints took 4.77 seconds






Since we have a PQ-code of 4 digits and each digit can take value {0,255} then the number of theoretical unique PQ-codes we can get is therefore $256^4 = 4,294,967,296$. However we can test that this is much less in reality.  


In [23]:
X_pq_code

array([[ 53,  53,  53, 149],
       [ 53,  53,  19, 125],
       [ 53,  53,  53,  21],
       ...,
       [242,  53, 195, 212],
       [ 53,  86, 243, 172],
       [ 53,  53,  53,  53]], shape=(1000001, 4), dtype=uint8)

In [17]:
import numpy as np
# Count unique rows
unique_rows = np.unique(X_pq_code, axis=0)
num_unique_vectors = unique_rows.shape[0]
print("Number of unique 4-dim vectors:", num_unique_vectors)

Number of unique 4-dim vectors: 89514


The main advantage of transforming the binary fingerprints into PQ-codes is that we are storing (almost) the same information in a much more efficient way. We can check that the amount of memory required to store the same data is 256x times less


In [22]:
print(f"Original input of shape: {X_test.shape} and size of {X_test.nbytes:,} bytes is now transformed into shape {X_pq_code.shape} and size of {X_pq_code.nbytes:,} bytes")
print(f"This is {int(X_test.nbytes / X_pq_code.nbytes)} times more memory efficient")

Original input of shape: (1000001, 1024) and size of 1,024,001,024 bytes is now transformed into shape (1000001, 4) and size of 4,000,004 bytes
This is 256 times more memory efficient
