Before running this file, you should use `DataPreprocessor.py` to remove stopwords from the Arxiv Dataset. 
Then upload the `stopwords-arxiv-metadata-oai-snapshot.json` file to the VM. This file was run
with Google Drive enabled, and expect the json data to be found under
`'/content/drive/MyDrive/data_resources/stopwords-arxiv-metadata-oai-snapshot.json'`

When generating the Embeddings for each article, batches of 250 articles are processed at 
a time in order to managed available memory resources. Each processed batch is saved 
to `/content/drive/MyDrive/data_resources/batches'` with a file name formatted as `batchSize_batchNumber.batch`

Once all embeddings are generated, the last cell in this notebook will generate a Zip file
so the embeddings can be downloaded in the event you max out free Google Collab resources 
so you can continue building the Annoy Index with `SentenceEmbeddings_CPU.py`

If you do use the Google Drive integration, make sure you have at least 9GB of free space available.
If you don't you can modify the paths above to use ones local to the Collab Environment 
but this may result in the files getting removed if you get a Collab Timeout.


In [1]:
! pip install nvidia_smi
! pip install psutil
! pip install nvidia-ml-py3
! pip install gensim
! pip install transformers
! pip install smart_open
! pip install torch
! pip install termcolor
! pip install annoy

Collecting nvidia_smi
  Downloading nvidia_smi-0.1.3-py36-none-any.whl (11 kB)
Collecting sorcery>=0.1.0
  Downloading sorcery-0.2.2-py3-none-any.whl (16 kB)
Collecting pytest>=4.3.1
  Downloading pytest-7.3.1-py3-none-any.whl (320 kB)
     -------------------------------------- 320.5/320.5 kB 9.7 MB/s eta 0:00:00
Collecting iniconfig
  Downloading iniconfig-2.0.0-py3-none-any.whl (5.9 kB)
Collecting pluggy<2.0,>=0.12
  Downloading pluggy-1.0.0-py2.py3-none-any.whl (13 kB)
Collecting littleutils>=0.2.1
  Downloading littleutils-0.2.2.tar.gz (6.6 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting wrapt
  Downloading wrapt-1.15.0-cp311-cp311-win_amd64.whl (36 kB)
Installing collected packages: littleutils, wrapt, pluggy, iniconfig, sorcery, pytest, nvidia_smi
  Running setup.py install for littleutils: started
  Running setup.py install for littleutils: finished with status 'done'
Successfully installed iniconfig-2.0.0 lit

  DEPRECATION: littleutils is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559

[notice] A new release of pip available: 22.3.1 -> 23.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


^C



[notice] A new release of pip available: 22.3.1 -> 23.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting nvidia-ml-py3
  Downloading nvidia-ml-py3-7.352.0.tar.gz (19 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Installing collected packages: nvidia-ml-py3
  Running setup.py install for nvidia-ml-py3: started
  Running setup.py install for nvidia-ml-py3: finished with status 'done'
Successfully installed nvidia-ml-py3-7.352.0


  DEPRECATION: nvidia-ml-py3 is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559

[notice] A new release of pip available: 22.3.1 -> 23.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip available: 22.3.1 -> 23.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip available: 22.3.1 -> 23.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Print out CPU, Memory, and GPU(if available) usages metrics


In [None]:

import tensorflow as tf
import math
import nvidia_smi
info_gpus = tf.config.list_physical_devices('GPU')
if len(info_gpus) > 0:
    nvidia_smi.nvmlInit()

    device_count = nvidia_smi.nvmlDeviceGetCount()
    for i in range(device_count):
      handle = nvidia_smi.nvmlDeviceGetHandleByIndex(i)
      info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
      print(f"Device {i}: {nvidia_smi.nvmlDeviceGetName(handle).decode()}")
      print(f"Memory : {round(100*info.free/info.total,2)}% free: {info.total}(total), {info.free} (free), {info.used} (used)")
    
    nvidia_smi.nvmlShutdown()
else:
  print("No GPU used")

No GPU used


In [None]:
import psutil
split_bar = '='*20
memory_info = psutil.virtual_memory()._asdict()
print(f"{split_bar} Memory Usage {split_bar}")
for k,v in memory_info.items():
  print(k, v)
print(f"{split_bar} CPU Usage {split_bar}")
print(f"CPU percent: {psutil.cpu_percent()}%")

total 13616324608
available 12115271680
percent 11.0
used 1159946240
free 10664103936
active 462086144
inactive 2251771904
buffers 142462976
cached 1649811456
shared 13709312
slab 163209216
CPU percent: 26.8%


Check to see if you have GPU resources available for you in the current Google Collab environment

In [None]:
import torch
# set to True to use the gpu (if there is one available)
use_gpu = True

# select device
device = torch.device('cuda' if use_gpu and torch.cuda.is_available() else 'cpu')
print(f'device: {device.type}')
	

device: cpu


In [None]:
import datetime
import json
import os
from typing import List
import smart_open
from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F
from termcolor import colored
from annoy import AnnoyIndex
from gensim.parsing.preprocessing import remove_stopwords
class NearestNeighborSearcher:
  """
  Class that creates the vector of the documents, and builds an annoy index, parses the query
  and returns resulting documents relevant to the query.
  """
  def __init__(self,batchSize,  index_name):
    """
    Initializes the NearestNeighborSearcher object with a pre-trained transformer model.
    """
    self.annoy_index = None
    self.index_name = index_name
    self.tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
    self.model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
    self.device ='cuda:0'
    self.model = self.model.to(self.device)
    self.vector_length = self.model.config.hidden_size  # setting the vectors used for the annoy index have the same
    self.batchSize = batchSize
    self.failedBatches =[]
    self.batchOutDir ='/content/drive/MyDrive/data_resources/batches'
    

  def mean_pooling(self, model_output, attention_mask):
    """
    Pool CLS Embedding Tensors to vectors so we can index them with Annoy. 
    Sourced from https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 
    """
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
	
  def _get_vectors(self, documents: List[str]) -> List[torch.Tensor]:
    """
    Tokenizes a list of documents, passes them through the transformer model, and returns their vectors.
    :param documents: List[str] - A list of documents to convert to vector
    :return: List[torch.Tensor] - A list of torch tensors containing the embeddings of the documents.
    """
    tokens = self.tokenizer(documents,padding=True, truncation=True, return_tensors='pt').to(self.device)
    with torch.no_grad():
      vectors = self.model(**tokens)
    ## shape of last hidden state (sentence, layer, embedding)
    sentence_embeddings = self.mean_pooling(vectors, tokens['attention_mask'])
    # Normalize embeddings
    sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

    #cls_embedding_vector = [vector.last_hidden_state[0][0] for vector in vectors]
    return sentence_embeddings
  def build_index(self, documents: List[str],totalBatches = None):
    """
    Builds an Annoy index based on the vectors of a list of documents.
    :param documents: List[str] - A list of documents to index.
    """        
    self.annoy_index = AnnoyIndex(self.vector_length, "angular")  # angular is the metric.
    self.processDocsInBatches(documents,totalBatches)
  def processDocsInBatches(self, documents, totalBatches = None):
    """
    Process Embedding files saved to disk and load them into an Annoy Index
    If totalBatches is specified, will only process that number of batches 
    to facilitate debugging
    """ 
    batchOutDir =  self.batchOutDir
    if totalBatches == None:
        totalBatches = int((len(documents) / self.batchSize ) + 1    )

    if os.path.exists(self.index_name):
        self.annoy_index.load(self.index_name)
    for batchNo in range(0, totalBatches):
        batchFile = f'{batchOutDir}/{self.batchSize}__{batchNo}.batch'
        if(os.path.exists(batchFile) == False):
            print(colored(f'batch {batchNo+1} not found. Skipping', 'red'))
            continue
        print(f"Processing batch {batchNo+1} of {int(totalBatches)}. ||   " + datetime.datetime.now().isoformat())
        batchStart = batchNo*self.batchSize
        try:
          vectors = torch.load(batchFile,map_location=torch.device('cpu'))
          for i, embedding in enumerate(vectors) :
              if(i ==0):
                print(f'annoy index item starts at {batchStart+i}')
              self.annoy_index.add_item(batchStart+i, embedding)
          print(f"Finished Processing batch {batchNo+1} of {int(totalBatches)}. ||   " + datetime.datetime.now().isoformat())
        except Exception as e: 
          print(colored(f"Failed Processing batch {batchNo+1} of {int(totalBatches)} ||   " + datetime.datetime.now().isoformat(), 'red'))
          print(e)
          self.failedBatches.append(batchNo)
    self.annoy_index.build(totalBatches) 
    print(f"saving index to {self.index_name}")
    self.annoy_index.save(self.index_name)
  def processTokensToEmbeddings(self, documents, totalBatches):
    """
    Convert a list of documents to Normalized CLS Embedding vectors in batches and save the 
    generated vectors to disk for later indexing in the event of a crash. If a batch
    has already been save to disk, that batch will be skipped
    """
		#batchNo = 0 #Left off at batch one
    batchOutDir = self.batchOutDir
    #
    if totalBatches == None:
      totalBatches = int((len(documents) / self.batchSize ) + 1    )

    for batchNo in range(0, totalBatches):
      batchFile = f'{batchOutDir}/{self.batchSize}__{batchNo}.batch'
      if(os.path.exists(batchFile)):
        print(f'batch {batchNo+1} already processed. Skipping')
        continue
      batchStart = batchNo*self.batchSize
      batchEnd = min(((batchNo+1) * self.batchSize), len(documents)) -1
      if(batchStart >= len(documents)):
        print(f"{batchNo} exists on disk already")
        break
      try:
        print(f"Processing batch {batchNo+1} of {int(totalBatches)}. Range {batchStart} to {batchEnd} ||   " + datetime.datetime.now().isoformat())
        batch = documents[batchStart:batchEnd  ]
        vectors = self._get_vectors(batch)  # Convert docs to vectors, to represented in vector space.
        print(f"Writing batch {batchNo+1} to {batchFile} ||   " + datetime.datetime.now().isoformat())
        f = open(batchFile, "x")
        torch.save(vectors,batchFile)			
        print(f"Finished Processing batch {batchNo+1} of {int(totalBatches)}. Range {batchStart} to {batchEnd} ||   " + datetime.datetime.now().isoformat())
      except Exception as e: 
        print(colored(f"Failed Processing batch {batchNo+1} of {int(totalBatches)}. Range {batchStart} to {batchEnd} ||   " + datetime.datetime.now().isoformat(), 'red'))
        print(e)
        self.failedBatches.append(batchNo)
    
  def openFile(self, filePath, isPreProcessed=True, isTokenized = False):
      with smart_open.open(filePath, encoding="utf-8") as f:
          jsonData = json.load(f)
          if isPreProcessed:
              for i, rawLine in enumerate(jsonData):
                  if(isTokenized == False):
                      tokens = gensim.utils.simple_preprocess(rawLine)
                  else: tokens = rawLine
                  yield tokens
          else: 
              for i, rawLine in enumerate(f):							
                  yield remove_stopwords(rawLine)

print(f'last updated {datetime.datetime.now().isoformat()}')

last updated 2023-04-14T20:18:25.425798


Create the class object for processing the Arxive Data set. 
Note that you may run into memory issues if you increase batchSize beyond 250.

In [None]:
tokenizedDataFile = '/content/drive/MyDrive/data_resources/stopwords-arxiv-metadata-oai-snapshot.json'
nns = NearestNeighborSearcher(batchSize=250, index_name="/content/drive/MyDrive/data_resources/arxiv_transformer_index.bin")
print(f'last updated {datetime.datetime.now().isoformat()}')

last updated 2023-04-14T20:18:28.538287


Load the Arxiv Data set  for processing

In [None]:
print("loading dataset from file - "  + datetime.datetime.now().isoformat())
documents = list(nns.openFile(tokenizedDataFile,isTokenized=True))    
print("finished loading dataset from file - "  + datetime.datetime.now().isoformat())
print("loaded " + str(len(documents))+ " items")

loading dataset from file - 2023-04-14T20:18:28.582642
finished loading dataset from file - 2023-04-14T20:19:06.292463
loaded 2227430 items


Process the Arxiv data set and generate the CLS embeddings. Saves the CLS 
embeddings in batches so we don't have to re-process a batch if we get
disconnected from Collab. If `totalBatches` is set to none, process the entire
data set, or set it to a positive integer to process batch 0 to totalBatches. 
If a batch has been saved already to the file system, it will not be processed 
again unless you delete that saved batch file.


In [None]:
nns.processTokensToEmbeddings(documents, totalBatches= None) 
## Print the batch number of any batches that we may have failed to process 
print(nns.failedBatches)

In [None]:
# Collab does not have enough memory to build the index, must download batches locally to desktop
#nns.build_index(documents, totalBatches = None)

Create and download a ZIP file containing the Embedding batches
for processing outside of Collab

In [None]:

!zip -r /content/batches.zip /content/drive/MyDrive/data_resources/batches
from google.colab import files
files.download("/content/batches.zip")