## PAPER EMBEDDER
This notebook is designed to automate the process of downloading the full **arXiv metadata dataset**, generating dense vector embeddings for the paper abstracts using a **high-performance Sentence Transformer** model, and saving the results into a single, comprehensive HDF5 file for subsequent use in semantic search or recommendation systems.

### 1. Initialization and Setup
This section handles the necessary setup steps, including cloning the required GitHub repository *(EmbedX)*, installing its dependencies, and navigating into the project directory.

In [None]:
%git clone https://github.com/huynguyen6906/EmbedX.git
%pip install -r EmbedX/requirements.txt
%cd EmbedX && pip install .

In [None]:
import json
import h5py
import gdown
import os
import json
import numpy as np
from tqdm.auto import tqdm
from sentence_transformers import SentenceTransformer

### 2. Data Preparation
This code block checks for the existence of the large **arXiv metadata snapshot** file. If the file is not found, it is downloaded from a specified Google Drive ID into the local `.cache` directory using `gdown`.

In [4]:
os.makedirs(".cache", exist_ok=True)
if not os.path.isfile(".cache/arxiv-metadata-oai-snapshot.json"):
    gdown.download(id='14QlvPBOCZVLKiqIZ6_7-7pP2lkS8Zd6Z', output='.cache/arxiv-metadata-oai-snapshot.json', quiet=False)

## 3. Helper Function: Merge HDF5 Files
The `merge_HDF5_files` function is crucial for combining the data generated from multiple processing chunks into a single, unified HDF5 file. It ensures the integrity of the data structure (matching `urls` and `embeddings` datasets) and efficiently resizes the output dataset to append new records.

In [5]:
def merge_HDF5_files(input_list, output_file):
    """
    Gộp dữ liệu từ nhiều file HDF5 (có cùng cấu trúc: urls, embeddings)
    """
    if not input_list:
        print("❌ Error: Input file list is empty.")
        return

    total_records = 0

    # 1. Xử lý file đầu tiên để xác định cấu trúc và khởi tạo Dataset
    first_file = None
    for f_path in input_list:
        if os.path.exists(f_path):
            first_file = f_path
            break

    if not first_file:
        print("❌ Error: No valid input files found.")
        return

    with h5py.File(first_file, 'r') as f_first:
        # Lấy thông tin dtype và chiều (dim)
        embed_shape = np.squeeze(f_first['embeddings'][:]).shape[1]
        embed_dtype = f_first['embeddings'].dtype
        url_dtype = f_first['urls'].dtype

    with h5py.File(output_file, 'w') as f_output:
        f_output.create_dataset(
            'urls',
            shape=(0,),
            maxshape=(None,),
            dtype=url_dtype,
            chunks=True
        )
        f_output.create_dataset(
            'embeddings',
            shape=(0, embed_shape),
            maxshape=(None, embed_shape),
            dtype=embed_dtype,
            chunks=True
        )

    pbar = tqdm(total = len(input_list), desc="Merging")

    # 2. Lặp và Gộp dữ liệu vào file Output (sử dụng chế độ 'a')
    with h5py.File(output_file, 'a') as f_output:
        for file_path in input_list:
            if not os.path.exists(file_path):
                print(f"⚠️ File not found: {file_path}")
                continue

            try:
                with h5py.File(file_path, 'r') as f_input:
                    current_urls = f_input['urls'][:]
                    current_embeddings = f_input['embeddings'][:]
                    current_embeddings = np.squeeze(current_embeddings)
                    num_records = current_urls.shape[0]

                    if num_records == 0:
                        continue

                    dset_urls = f_output['urls']
                    dset_embeddings = f_output['embeddings']

                    new_size = total_records + num_records

                    # Thay đổi kích thước (Resize) Dataset trong file output
                    dset_urls.resize(new_size, axis=0)
                    dset_embeddings.resize(new_size, axis=0)

                    # Ghi dữ liệu vào khoảng trống vừa resize
                    dset_urls[total_records:new_size] = current_urls
                    dset_embeddings[total_records:new_size] = current_embeddings

                    # Cập nhật tổng số bản ghi
                    total_records = new_size

            except Exception as e:
                print(f"❌ Error processing file {file_path}: {e}. Skipping this file.")

            pbar.update(1)
    pbar.close()

## 4. Main Execution: Abstract Embedding, Chunking, and Cleanup
This block executes the core logic. It reads the metadata file chunk-by-chunk, loads the specified model to encode abstracts, applies L2-normalization, saves the chunk results, and finally merges all files and cleans up temporary data.

**Model Used for Embeddings**

The abstracts are encoded using the `all-roberta-large-v1` model from the `sentence-transformers` library, a robust model suitable for high-quality semantic similarity measurements.

In [None]:
start = 0
end = 2866787
chunk = 1000
file_path = '.cache/arxiv-metadata-oai-snapshot.json'
model = SentenceTransformer('all-roberta-large-v1')
pbar = tqdm(total=end - start, desc="Embedding: ")

if not os.path.exists("PAPERS"):
    os.makedirs("PAPERS")

for x in range(start, end, chunk):
    papers_list = [] 
    START = x
    END = min(x+chunk, end) 
    
    valid_line_index = 0 
    
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            if not line.strip():
                continue
                
            if valid_line_index >= END:
                break
                
            if valid_line_index >= START:
                try:
                    paper_data = json.loads(line)
                    papers_list.append(paper_data)
                except json.JSONDecodeError:
                    pass
            
            valid_line_index += 1

    local_indices = range(len(papers_list))     
    final_urls = [f"https://arxiv.org/pdf/{papers_list[i]['id']}.pdf" for i in local_indices]
    texts_to_encode = [papers_list[i]['abstract'] for i in local_indices]
    try:
        all_embeddings_numpy = model.encode(
            texts_to_encode, 
            show_progress_bar=True,
            convert_to_numpy=True,
            batch_size=32
        )
    except NameError:
        print("\n*** LỖI: Biến 'model' chưa được định nghĩa/load. Không thể encode. ***")
        break
        
    pbar.update(len(texts_to_encode))
    
    norms = np.linalg.norm(all_embeddings_numpy, axis=1, keepdims=True)
    all_embeddings_numpy = all_embeddings_numpy / norms

    output_path = f"PAPERS/Papers_Embedded_{int(x/chunk)}.h5"
    with h5py.File(output_path, "w") as outfile:

        outfile.create_dataset("urls", data=np.array(final_urls, dtype='S')) 
        outfile.create_dataset("embeddings", data=all_embeddings_numpy)
    
    del papers_list 

pbar.close()

# Gộp các file HDF5
file_chunks = [
    f"PAPERS/Papers_Embedded_{int(x/chunk)}.h5"
    for x in range(start, end, chunk)
]
file_gop_cuoi = f"PAPERS/Papers_Embedded_{start}-{end}.h5"
merge_HDF5_files(file_chunks, file_gop_cuoi)

# Xóa các file chunk
for x in range(start, end, chunk):
    os.remove(f"PAPERS/Papers_Embedded_{int(x/chunk)}.h5")