## PAPER EMBEDDER
This notebook is designed to automate the process of downloading the full **arXiv metadata dataset**, generating dense vector embeddings for the paper abstracts using a **high-performance Sentence Transformer** model, and saving the results into a single, comprehensive HDF5 file for subsequent use in semantic search or recommendation systems.

### 1. Initialization and Setup
This section handles the necessary setup steps, including cloning the required GitHub repository *(EmbedX)*, installing its dependencies, and navigating into the project directory.

In [None]:
!git clone https://github.com/huynguyen6906/EmbedX.git
!pip install -r EmbedX/requirements.txt
!cd EmbedX && pip install .

Cloning into 'EmbedX'...
remote: Enumerating objects: 71, done.[K
remote: Counting objects: 100% (71/71), done.[K
remote: Compressing objects: 100% (49/49), done.[K
remote: Total 71 (delta 40), reused 49 (delta 18), pack-reused 0 (from 0)[K
Receiving objects: 100% (71/71), 9.51 KiB | 9.51 MiB/s, done.
Resolving deltas: 100% (40/40), done.
Collecting numpy (from -r EmbedX/requirements.txt (line 2))
  Using cached numpy-2.2.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Collecting h5py (from -r EmbedX/requirements.txt (line 3))
  Using cached h5py-3.15.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (3.0 kB)
Collecting tqdm (from -r EmbedX/requirements.txt (line 4))
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting PyMuPDF (from -r EmbedX/requirements.txt (line 5))
  Using cached pymupdf-1.26.5-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Collecting sentence-transformers (from -r EmbedX/requirements.

In [1]:
import json
import h5py
import gdown
import os
import json
import numpy as np
from tqdm.auto import tqdm
from sentence_transformers import SentenceTransformer

### 2. Data Preparation
This code block checks for the existence of the large **arXiv metadata snapshot** file. If the file is not found, it is downloaded from a specified Google Drive ID into the local `.cache` directory using `gdown`.

In [None]:
# Ensure the local directory for caching files exists.
os.makedirs(".cache", exist_ok=True)

# Check if the main arXiv metadata file is already downloaded in the cache.
if not os.path.isfile(".cache/arxiv-metadata-oai-snapshot.json"):
    # If not present, download the snapshot file using its Google Drive ID.
    gdown.download(id='14QlvPBOCZVLKiqIZ6_7-7pP2lkS8Zd6Z', output='.cache/arxiv-metadata-oai-snapshot.json', quiet=False)

## 3. Helper Function: Merge HDF5 Files
The `merge_HDF5_files` function is crucial for combining the data generated from multiple processing chunks into a single, unified HDF5 file. It ensures the integrity of the data structure (matching `urls` and `embeddings` datasets) and efficiently resizes the output dataset to append new records.

In [2]:
def merge_HDF5_files(input_list, output_file):
    """
    Merges data (URLs and embeddings) from multiple HDF5 files into a single output file.
    """
    if not input_list:
        print("❌ Error: Input file list is empty.")
        return

    total_records = 0

    # 1. Initialize Output Structure based on the first valid file
    first_file = None
    # Find the first existing file to determine the required structure (dtype, shape).
    for f_path in input_list:
        if os.path.exists(f_path):
            first_file = f_path
            break

    if not first_file:
        print("❌ Error: No valid input files found.")
        return

    with h5py.File(first_file, 'r') as f_first:
        # Get embedding dimension (shape[1]) and data types (dtype) for initialization.
        # np.squeeze is used to handle potential extra dimensions (e.g., shape (N, 1, D) -> (N, D)).
        embed_shape = np.squeeze(f_first['embeddings'][:]).shape[1]
        embed_dtype = f_first['embeddings'].dtype
        url_dtype = f_first['urls'].dtype

    # Create the output file and initialize extendable datasets.
    with h5py.File(output_file, 'w') as f_output:
        # Initialize 'urls' dataset with zero length, maxshape=(None,) allows extension.
        f_output.create_dataset(
            'urls',
            shape=(0,),
            maxshape=(None,),
            dtype=url_dtype,
            chunks=True
        )
        # Initialize 'embeddings' dataset with zero length, maxshape=(None, embed_shape) allows extension.
        f_output.create_dataset(
            'embeddings',
            shape=(0, embed_shape),
            maxshape=(None, embed_shape),
            dtype=embed_dtype,
            chunks=True
        )

    pbar = tqdm(total = len(input_list), desc="Merging")

    # 2. Loop through input files and append data
    # Open the output file in append mode ('a') for modification.
    with h5py.File(output_file, 'a') as f_output:
        for file_path in input_list:
            if not os.path.exists(file_path):
                print(f"⚠️ File not found: {file_path}. Skipping.")
                pbar.update(1)
                continue

            try:
                with h5py.File(file_path, 'r') as f_input:
                    # Read data from the current input file.
                    current_urls = f_input['urls'][:]
                    current_embeddings = f_input['embeddings'][:]
                    current_embeddings = np.squeeze(current_embeddings)
                    num_records = current_urls.shape[0]

                    if num_records == 0:
                        continue

                    dset_urls = f_output['urls']
                    dset_embeddings = f_output['embeddings']

                    new_size = total_records + num_records

                    # Resize the datasets in the output file to accommodate new records.
                    dset_urls.resize(new_size, axis=0)
                    dset_embeddings.resize(new_size, axis=0)

                    # Write the current file's data into the newly reserved space.
                    dset_urls[total_records:new_size] = current_urls
                    dset_embeddings[total_records:new_size] = current_embeddings

                    # Update the running total of merged records.
                    total_records = new_size

            except Exception as e:
                # Handle potential errors during file reading or resizing.
                print(f"❌ Error processing file {file_path}: {e}. Skipping this file.")

            pbar.update(1)
    pbar.close()

## 4. Main Execution: Abstract Embedding, Chunking, and Cleanup
This block executes the core logic. It reads the metadata file chunk-by-chunk, loads the specified model to encode abstracts, applies L2-normalization, saves the chunk results, and finally merges all files and cleans up temporary data.

**Model Used for Embeddings**

The abstracts are encoded using the `all-roberta-large-v1` model from the `sentence-transformers` library, a robust model suitable for high-quality semantic similarity measurements.

In [5]:
start = 0
end = 100
chunk = 10
file_path = '.cache/arxiv-metadata-oai-snapshot.json'
model = SentenceTransformer('all-roberta-large-v1')

In [6]:
# Ensure the output directory for the embedded paper data exists.
pbar = tqdm(total=end - start, desc="Embedding: ")
if not os.path.exists("OUTPUT"):
    os.makedirs("OUTPUT")

# Main loop: Iterate over the specified range (start to end) in CHUNK increments.
for x in range(start, end, chunk):
    papers_list = [] 
    START = x
    END = min(x + chunk, end) 
    
    valid_line_index = 0 
    
    # 1. READ AND CHUNK METADATA FILE
    # Open the large metadata file (e.g., arxiv-metadata-oai-snapshot.json).
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            # Skip empty lines.
            if not line.strip():
                continue
                
            # Stop reading once the end index of the current chunk is reached.
            if valid_line_index >= END:
                break
                
            # Process lines that fall within the current chunk's start index.
            if valid_line_index >= START:
                try:
                    # Parse the JSON data from the current line (assuming JSONL format).
                    paper_data = json.loads(line)
                    papers_list.append(paper_data)
                except json.JSONDecodeError:
                    # Ignore lines that are not valid JSON.
                    pass
            
            # Increment the counter for valid (non-empty) lines processed.
            valid_line_index += 1

    # 2. PREPARE DATA FOR EMBEDDING
    local_indices = range(len(papers_list))     
    # Generate the full arXiv PDF URLs using the paper 'id'.
    final_urls = [f"https://arxiv.org/pdf/{papers_list[i]['id']}.pdf" for i in local_indices]
    # Extract the abstract text to be encoded.
    texts_to_encode = [papers_list[i]['abstract'] for i in local_indices]
    
    # 3. GENERATE EMBEDDINGS
    try:
        # Use the loaded model (e.g., Sentence Transformer or CLIP text encoder) 
        # to convert abstracts into dense vectors.
        all_embeddings_numpy = model.encode(
            texts_to_encode, 
            show_progress_bar=True,
            convert_to_numpy=True,
            batch_size=32 # Use a small batch size for memory efficiency.
        )
    except NameError:
        print("\n*** ERROR: 'model' variable is undefined/not loaded. Cannot encode. ***")
        break
        
    # Update the external progress bar with the number of texts encoded in this chunk.
    pbar.update(len(texts_to_encode))
    
    # Normalize the embeddings (L2 normalization is standard practice for cosine similarity).
    norms = np.linalg.norm(all_embeddings_numpy, axis=1, keepdims=True)
    all_embeddings_numpy = all_embeddings_numpy / norms

    # 4. WRITE HDF5 CHUNK FILE
    # Define the output path for the current chunk.
    output_path = f"OUTPUT/Papers_Embedded_{int(x/chunk)}.h5"
    
    # Write the URLs and embeddings to the HDF5 file.
    with h5py.File(output_path, "w") as outfile:
        # URLs are typically stored as variable-length byte strings ('S') in HDF5.
        outfile.create_dataset("urls", data=np.array(final_urls, dtype='S')) 
        outfile.create_dataset("embeddings", data=all_embeddings_numpy)
    
    # Free up memory used by the papers list before the next iteration.
    del papers_list 

pbar.close()

Embedding:   0%|          | 0/100 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [7]:
# 1. MERGE CHUNKED HDF5 FILES
# Generate a list of all HDF5 chunk files created during the embedding process.
file_chunks = [
    f"OUTPUT/Papers_Embedded_{int(x/chunk)}.h5"
    for x in range(start, end, chunk)
]

# Define the final, consolidated HDF5 file path.
file_gop_cuoi = f"OUTPUT/Papers_Embedded_{start}-{end}.h5"

# Call the function to merge all chunk files into the single final file.
merge_HDF5_files(file_chunks, file_gop_cuoi)

# 2. CLEANUP: REMOVE TEMPORARY CHUNKS
# Iterate through the indices used to generate the chunk files.
for x in range(start, end, chunk):
    chunk_path = f"OUTPUT/Papers_Embedded_{int(x/chunk)}.h5"
    try:
        # Delete the individual temporary HDF5 chunk file to free up disk space.
        os.remove(chunk_path)
    except FileNotFoundError:
        # Handle case where the file might have been skipped or already deleted.
        print(f"⚠️ Warning: Chunk file not found during cleanup: {chunk_path}")
        continue

Merging:   0%|          | 0/10 [00:00<?, ?it/s]