# Train and apply PCA dimensionality reduction to the sample data
This notebook will walk through the steps necessary to train and apply Principle Component Analysis (PCA) dimensionality reduction to the sample data embeddings in order to reduce the memory footprint of the Milvus data (or any vector database used to host the embeddings). PCA works by identifying the N dimensions with the largest co-variance (ie: the N least "noisy" dimensions) and only keeping those N dimensions accross all of the vectors. This allows us to signficantly reduce the size of the dataset while maintaining a very close approximation of the original vectors. We will use the faiss library to train and apply the PCA matrix for convenience.

Note: once the PCA dimesnionality matrix has been trained and applied to the embeddings, the same PCA dimensionality reduction will have to be applied to any search vectors (since the search vectors must have the same dimensions as the indexed vectors). 

### Pre-requisites: 
You will need the sample data generated in the pervious notebook: `1_extract_sample_data.ipynb`

**Author:** Leo Thomas - leo@developmentseed.org\
**Last updated:** 2023/06/15

In [1]:
import os
import numpy as np
import faiss
from tqdm.notebook import tqdm

### 1.0. Define data source and collect embedding files

In [2]:
DATA_DIR = os.path.abspath("./one_percent_data_sample")
EMBEDDINGS_DIR = os.path.join(DATA_DIR, "embeddings")

In [3]:
embeddings_files = [
    os.path.join(EMBEDDINGS_DIR, file) 
    for file in os.listdir(EMBEDDINGS_DIR) 
    if os.path.isfile(os.path.join(EMBEDDINGS_DIR, file))
]

### 2.0. Define training sample size
The PCA matrix needs to hold all vectors in memory while training. In order to manage memory capacity we will randomly select a subset of the vectors to train the PCA matrix. It's very import that the vectors be randomly selected such that they have the same distribution as the overall vectors to which the PCA dimensionality reduction will be applied.

In [4]:
PCA_TRAINING_SAMPLE_SIZE = 0.25

In [5]:
%%time

training_set = []

for embedding_file in tqdm(embeddings_files): 

    embeddings = np.load(embedding_file)
    
    # select random subset by generated random indexes to select
    rand_indexes = np.random.randint(
        low=0, high=len(embeddings), size=int(PCA_TRAINING_SAMPLE_SIZE * len(embeddings))
    )
    subset = embeddings[rand_indexes]
    
    # use native python list for append to the training set 
    # (since you can't append to numpy arrays inplace)
    training_set.extend(list(subset))

# convert training set to numpy array for compatibility
# with the faiss PCA module
training_set = np.array(training_set)
training_set.shape

  0%|          | 0/374 [00:00<?, ?it/s]

CPU times: user 22.3 s, sys: 2min 28s, total: 2min 50s
Wall time: 4min 34s


(3602912, 1280)

### 3.0. Train the PCA matrix

In [6]:
%%time
# define PCA matrix
mat = faiss.PCAMatrix(1280, 256)

# train
mat.train(training_set)

# write to file
faiss.write_VectorTransform(mat, "1280_to_256_dimensionality_reduction.pca")

CPU times: user 1min 1s, sys: 7min 16s, total: 8min 17s
Wall time: 22min 40s


### 4.0. Apply the trained PCA matrix to the entire embedding set and save

In [11]:
REDUCED_EMBEDDINGS_DIR = os.path.join(DATA_DIR, "reduced_embeddings")
if not os.path.exists(REDUCED_EMBEDDINGS_DIR):
    os.mkdir(REDUCED_EMBEDDINGS_DIR)

In [12]:
%%time
# read trained PCA matrix from file
pca_matrix = faiss.read_VectorTransform("1280_to_256_dimensionality_reduction.pca")

for embedding_file in tqdm(embeddings_files): 
    
    embeddings = np.load(embedding_file)
    
    # apply the dimensionality reduction using the PCA matrix
    reduced_embeddings = pca_matrix.apply(embeddings)
    
    # write the reduced embeddings to disk
    filename = embedding_file.split("/")[-1]
    reduced_embeddings_filepath = os.path.join(REDUCED_EMBEDDINGS_DIR, filename)
    np.save(reduced_embeddings_filepath, reduced_embeddings)

  0%|          | 0/374 [00:00<?, ?it/s]

CPU times: user 2min 15s, sys: 55.9 s, total: 3min 11s
Wall time: 1min 41s
