### **ArXiv Assistant Project - Dataset Preparation**

This notebook focuses on processing and preparing the arXiv dataset for use in the ArXiv Assistant project. It demonstrates the process of accessing arXiv papers from a Google Cloud Storage bucket, processing them into text chunks, and storing the resulting dataset on Hugging Face.

The notebook is designed to handle large-scale data processing efficiently, utilizing multiprocessing and cloud storage integration. It serves as an intermediate step in creating a high-quality dataset for training and fine-tuning language models on arXiv content.

The Notebook includes the following key components:

1. **Setup and Configuration**:
   - Imports necessary libraries including google-cloud-storage, llama-index, and various data processing tools.
   - Sets up authentication for Google Cloud and Hugging Face.
   - Configures environment variables and paths.

2. **Google Cloud Storage Integration**:
   - Functions to list and access files in the specified GCS bucket.
   - Exploration of the arXiv PDF files stored in the bucket.

3. **Data Processing Script Execution**:
   - Runs a Python script (`process_arxiv_data.py`) to process the arXiv PDFs.
   - Handles chunking of text, potentially using embeddings for semantic chunking.
   - Processes multiple months of arXiv papers (e.g., from 2209 to 2403).

4. **Dataset Creation and Upload**:
   - Creates a dataset from the processed chunks.
   - Uploads the resulting dataset to Hugging Face Datasets.

5. **Multiprocessing Utilization**:
   - Leverages multiprocessing to efficiently handle the large volume of data.

6. **Customization and Configuration**:
   - Allows for customization of processing parameters such as chunk size, number of files to process, and specific arXiv folders to include.

This notebook serves as a vital tool in the data preparation pipeline for the ArXiv Assistant project, enabling the creation of a large, well-structured dataset of arXiv papers for further use in model training and fine-tuning.

Author: Amr Sherif  
Created Date: 2024-06-13  
Updated Date: 2024-09-30  
Version: 1.0

In [1]:
from baseline.helpers import set_css

get_ipython().events.register('pre_run_cell', set_css)

In [2]:
!pip install --upgrade google-cloud-storage
!pip install llama-index
!pip install llama-index-embeddings-huggingface llama_index datasets PyMuPDF huggingface_hub transformers llama-index-embeddings-instructor
! pip install sentence-transformers
! pip install llama-index-embeddings-langchain
! pip install langchain langchain-community langchain-core
!pip install python-dotenv
!pip install llama-index-readers-gcs

In [4]:
from google.colab import drive
drive.mount('/content/drive')

In [5]:
%cd "DRIVE_PATH"

In [None]:
from google.colab import auth
auth.authenticate_user()

In [6]:
%ls

In [7]:
from google.colab import userdata
import os

os.environ['hf'] = userdata.get('hf')
service_account_key_path = userdata.get('service_account_key_path')

In [2]:
import multiprocessing

cores = multiprocessing.cpu_count()
cores

#### List and explore files in the bucket 

In [8]:
bucket_name = "arxiv-dataset"
prefix = "arxiv/arxiv/pdf"

In [None]:
from google.cloud import storage

def list_files_in_bucket(bucket_name, prefix, service_account_key_path):
    client = storage.Client.from_service_account_json(service_account_key_path)
    bucket = client.bucket(bucket_name)
    blobs = bucket.list_blobs(prefix=prefix)
    for page in blobs.pages:
        yield from [blob.name for blob in page]

In [None]:
a = list_files_in_bucket(bucket_name, prefix, service_account_key_path)

In [None]:
for i, file in enumerate(list_files_in_bucket(bucket_name, prefix, service_account_key_path)):
  print(i)

#### Process arXiv data into text chunks and store them in a dataset on Hugging Face Datasets

In [9]:
!python process_arxiv_data.py --bucket_name $bucket_name --prefix $prefix --folders 2403 2402 2401 2312 2311 2310 2309 2308 2307 \
2306 2305 2304 2303 2302 2301 2212 2211 2210 2209  \
 --service_account_key_path $service_account_key_path \
 --model_name "WhereIsAI/UAE-Large-V1" --hf_username "amrachraf" \
 --hf_dataset_name "arXiv-full-text-chunked" --chunk_size_gb 0.1 \
 --local_path "TEMP_DIR_PATH" \
 --base_chunk_count 4 --num_files_limit 200 --max_files 400 --use_folder_limit True